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Abstract 



Recently, there has been increasing interest in reporting diagnostic scores. This paper examines 
reporting of subscores using multidimensional item response theory (MIRT) models. An MIRT 
model is fitted using a stabilized Newton- Raphson algorithm (Haberman, 1974, 1988) with 
adaptive Gauss-Hermite quadrature (Haberman, von Davier, & Lee, 2008). A new statistical 
approach is proposed to assess when subscores using the MIRT model have any added value over 
(a) the total score or (b) subscores based on classical test theory (Haberman, 2008; Haberman, 
Sinharay, & Puhan, 2006). The MIRT-based methods are applied to several operational data sets. 
The results show that the subscores based on MIRT are slightly more accurate than subscore 
estimates derived by classical test theory. 
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There is an increasing interest in subscores because of their potential diagnostic value. Failing 
candidates want to know their strengths and weaknesses in different content areas to plan for 
future remedial work. States and academic institutions such as colleges and universities often 
want a profile of performance for their graduates to better evaluate their training and focus on 
areas that need instructional improvement (Haladyna & Kramer, 2004). 

Multidimensional item response theory (MIRT) models can be employed to report subscores. 
Several papers have suggested this approach, although current approaches have been somewhat 
problematic in terms of practical application to testing programs with limited time for analysis. 
For instance, de la Torre and Patz (2005) applied an MIRT model to data from tests that measure 
multiple correlated abilities. This method can be used to estimate subscores, although the 
subscores, which are components of the ability vector in the MIRT model, are in the scale of the 
ability parameters rather than in the scale of the raw scores. This approach provided results very 
similar to those based on augmentation of raw subscores (Wainer et al., 2001). Yao and Boughton 
(2007) also examined subscore reporting based on an MIRT model and the Markov-chain 
Monte-Carlo (MCMC) algorithm. However, the MCMC algorithm employed in de la Torre and 
Patz (2005) or Yao and Boughton (2007) is more computationally intensive than is currently 
practical given the time constraints of many testing programs. In addition, determination of 
convergence of an MCMC algorithm is not straightforward for a typical psychometrician working 
for a testing company. Researchers have also compared different approaches, including the 
MIRT-based methods, for reporting subscores. For example, Dwyer, Boughton, Yao, Steffen, and 
Lewis (2006) compared four methods: raw subscores, the objective performance index (OPI) 
described in Yen (1987), Wainer augmentation, and MIRT-based subscores. On the whole, they 
found that the MIRT-based methods and augmentation methods provided the best estimates of 
subscores. 

This paper fits the MIRT model using a stabilized Newton-Raphson algorithm (Haberman, 
1974, 1988) with adaptive Gauss-Hermite quadrature (Haberman, von Davier, & Lee, 2008). In 
typical applications, this algorithm is far faster than the MCMC algorithm, so that methods used 
in this paper can be considered in operational testing. In addition, a new statistical approach 
is proposed to assess when subscores obtained using MIRT have any added value over (a) the 
total score and (b) subscores based on classical test theory. This work extends to MIRT models 
the research of Haberman (2008) and Haberman, Sinharay, and Puhan (2006), who suggested 
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methods based on classical test theory (CTT) to examine whether subscores provide any added 
value over total scores. 

Section 1 of this report provides a brief overview of the CTT-based methods of Haberman 
(2008) and Haberman et al. (2006). Section 2 introduces the MIRT model under study, suggests 
how to compute the subscores based on MIRT, and suggests how to assess when subscores using 
MIRT have any added value over the total score and over subscores based on classical test theory. 
Section 3 illustrates application of the methods to several data sets. Section 4 provides conclusions 
based on the empirical results observed. 

Discussion in this report is confined to right-scored tests in which subscores of interest do not 
share common items. Adaptation to tests with polytomous items is straightforward. Treatment of 
subscores with overlapping items is somewhat more complicated. The authors plan to report on 
this case in a future publication. 



1 Methods From Classical Test Theory 

This section describes the approach of Haberman (2008) and Haberman et al. (2006) to 
determine whether and how to report subscores. Consider a test with q > 2 right-scored items. 

A sample of n > 2 examinees is used in analysis of the data. For examinee i, 1 < i < n, and 
for item j, 1 < j < q, Xij is 1 if the response to item j is correct, and Xij is 0 otherwise. 

The g-dinrensional vectors X; with coordinates X t] , 1 < j < q, are independent and identically 
distributed for examinees i from 1 to n, and the set of possible values of X; is denoted by T. The 
items test r > 2 skills numbered from 1 to r. To each item j, 1 < j < q. corresponds a single skill 
v(j), 1 < v(j) < r. It is assumed that each skill corresponds to some item. Thus, if J(k) denotes 
the set of items j with skill v(j) = k, then J(k) is nonempty for 1 < k < r. 

In a CTT-based analysis, examinee i has total raw score 

Si = ^2 Xij 
3 = 1 



and raw subscore 

Sik — ^ ' Xij , 

jeJ(k) 

which corresponds to skill k. The true score corresponding to Si is the true total raw score T), and 
the true score corresponding to is the true raw subscore T^. Proposed subscores are judged 
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by how well they approximate the true subscores Tik- The following subscores are considered for 
examinee i and skill k: 

• The linear combination Uk s = oiks + flksSik based on the raw subscore Sik, which yields the 
minimum (denoted as t£ s ) of the mean-squared error E([Tjk — Uiks] 2 )- 

• The linear combination U^ x = ctkx + PkxSi based on the raw total score Si, which yields the 
minimum (t£ x ) of the mean-squared error E([Tjk — Uik x ] 2 )- 

• The linear combination U^c = + PkicSi + Pk 2 cSik based on the raw subscore S,k and raw 

total score Si, which yields the minimum (r| c ) of the mean-squared error E([Tik — Uik c ] 2 )- 

The subscore Uik c is an example of an augmented subscore (Wainer et al., 2001). We will 
often refer to the procedure by which Uikc is obtained as the Haberman augmentation. It is 
also possible to consider an augmented subscore Uik a = &ka + Sfc'=i Pkk'aSik' based on all the 
raw subscores (Wainer et al., 2001), which yields the minimum t? of the mean-squared error 
E ( [E t k — Uika] 2 )- Because this augmentation typically provides results that are very similar to 
those of Haberman augmentation, we do not provide any results for U^ a in this paper. 

To compare the possible subscores, proportional reduction in mean-squared error (PRMSE) 
is employed. Let r| 0 be the variance of the true raw subscore T, \k, so that r| 0 is the minimum of 
E ( [E t k — Uiko ] 2 ) for the constant approximation Uik o = a*, o- Then r| s , t^ x , t| c , and r| 0 cannot 
exceed r| 0 . The proportional reductions of mean-squared error for the subscores under study are 

PR.MSE fc , = 1 - tIJt 2 } , 

PRMSEfca, = 1 — t 2 x ! r| 0 , 

PRMSEfc c = 1 — r| c /r| 0 , 

and 

PRMSE fca = 1 - tIJtIq. 

The reliability coefficient of S^ is PRMSEfc s . Each PRMSE is between 0 and 1. Because reduced 
mean-squared error is desired, it is clearly best to have a PRMSE close to 1. It is always the case 
that PRMSE fcs < PRMSE fcc , PRMSE fca: < PRMSE fcc , and PRMSE fcc < PRMSE fca . 

Consideration of the competing interests of simplicity and accuracy suggests the following 
strategy (Haberman, 2008; Haberman et al., 2006) for skill k: 
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• If PRMSEfc s is less than PRMSEfca,, declare that the subscore does not provide added value 
over the total score. 

• Use U kc only if PRMSEfc c is substantially larger than the maximum of PRMSE^ and PRMSEfca,. 

The first recommendation reflects the fact that the observed total score will provide more 
accurate diagnostic information than the observed subscore if PRMSE^ is less than PRMSEfca,. 
Sinharay, Haberman, and Puhan (2007) discussed the strategy in terms of reasonableness and in 
terms of compliance with professional standards. The second recommendation involves the slight 
increase in computation when Ukc is employed and the challenges in explaining score augmentation 
to clients. In practice, use of Uik s is most attractive if the raw subscore Sik has high reliability and 
if the correlations of the true raw subscores are not very high (Haberman, 2008; Haberman et al., 
2006). 

Haberman (2008) discussed the estimation from sample data of the proposed subscores, the 
regression coefficients, the mean-squared errors, and PRMSE coefficients. The straightforward 
computations depend only on the sample moments and correlations among the subscores and their 
reliabilities. For large samples, the decrease in PRMSE due to estimation is negligible. 

2 The Two-Parameter Logistic (2PL) MIRT Model 

The two-paranreter logistic (2PL) MIRT model employed in this report is a simple-structure 
model described in Haberman et al. (2008). The basic 2PL MIRT model under study assumes 
that an r-dimensional random ability vector 0 * with coordinates 0^, 1 < k < r is associated with 
each examinee i. The pairs (Xj,0j), 1 < i < n are independent and identically distributed, and, 
for each examinee i, the response variables X t j , 1 < j < q, are conditionally independent given 6 t . 
Let 

P(h; y ) = exp(hy)/[l + exp(y)] 

for h and y real. 

To each item j, 1 < j < q, the conditional probability that = h given 0j = u>, where u is 
an r-dimensional vector of real numbers, is P(h ; a.jU) v ^ — 7 j) for an unknown item discrimination 
aj and an unknown real parameter 7 j. Provided that the discrimination aj is positive, the item 
difficulty for item j is then 7 j/aj = fej.The conditional probability that X, = x given that 0j is 
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equal to the ?’-dimensional vector u is then 

<2 

p(x |0) = n - 7?)- (!) 

i=i 

If c ij is constant for j in J(k), 1 < k < r, then one has a multidimensional one-parameter logistic 
(1PL) model. 

In this report, the assumption is made that 6i has a multivariate normal distribution IV (0, D). 
Here 0 is the r-dimensional vector with all coordinates 0, and D is an r-by-r positive-definite 
symmetric matrix with elements dkk', 1 < k < r, 1 < k 1 < r, such that each diagonal element dkk 
is equal to 1 , and each off-diagonal element dkk' , k ^ k' , is the unknown correlation of 9 ik and 9^. 
The assumption that the mean of 9, is 0 and the variance dkk of each 9,k is 1 is imposed to permit 
identification of the item parameters aj and bj for each item j from 1 to q. Alternative analysis is 
possible in which other distributions of 9 L are considered (Haberman et al., 2008). 

The model parameters aj and 7 j, 1 < j < q, and dkk', 1 < k < k! < r, may be estimated 
by maximum-likelihood by means of a version of the stabilized Newton-Raphson algorithm 
(Haberman, 1988) described in Haberman et al. (2008). Because calculations employ adaptive 
multivariate Gauss-Hermite integration, computational time is not excessive (Schilling & Bock, 
2005). 

The maximum-likelihood estimates dj of aj, 7 j of 7 j, and D of D continue to estimate 
meaningful parameters even if the model does not hold because aj, 7 j, and D can be selected to 
minimize the expected log penalty function E(— logp(Xj)) for p(x), x in T, the expected value of 
p{x.\9i) (Gilula & Haberman, 1994, 2001; Haberman, 2007). In this fashion, aj, 7 j, and D can be 
regarded as the parameters that result in the best correspondence between the model and the 
actual probability distribution of the response vector X. If the model holds, then the optimal aj 
and 7 j are the model parameters in (1), and the optimal D is the covariance matrix of 9 L . 

Given the general definition of the model parameters in terms of expected log penalty, the 
ability parameter 0 L can be defined and approximated even if the underlying model is not accurate 
(Haberman, 2007). To do so, let be defined as a random vector such that the conditional 
distribution of 9, given X.; = x is the same as the conditional distribution of a random vector 9* 
given the random vector X* with values in T, where 9* has a multivariate normal distribution 
with zero mean and covariance matrix D and the conditional probability that X* = x in T given 
9* = u is p(x|u>). Let 7 r denote the density function of a multivariate normal random vector 
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with mean 0 and covariance matrix D, so that p(x) is the integral f p(x|u>)7r(u;)dcj. By Bayes’s 
theorem, the conditional density of 0, given X,- = x has value 

/(w|x) =p(x|w)7r(u/)/p(x) 



at the ?’-dimensional vector u. The unconditional density of 0% at u> is then 



/ M = E(f(u |Xi)). 



If the model holds, then f(u>) = 7r(a>). 

The expected a posteriori (EAP) mean 0i of 0 L given Xj (Bock & Aitkin, 1981) is the basis 
for the analysis of subscores by multivariate item response models. This mean is f u?f(u\Xi)du. 
Clearly 0i has expectation E{0 - L ) = f u>f{u>)du. The covariance matrix of 0 L is 

Cov(0i) =f[u- E{0i)}[u - E{0i)}'f(uj)du, 

where the prime indicates a transpose, while the approximation error 0 t — 0i has zero mean and 
covariance matrix 

Cov(0i — Oi) = E(Cov(0j|Xj)), 

where 

Cov(0j|x) = J (uj — 6i)(u) — 0i)' f(ui\x)du. 

For 1 < k < r, let the coordinate vector <5 be the r-dimensional vector with coordinates 

f 1, k' = k, 

S k ’k = < 

[o, k! + k, 

for 1 < k' < k. The fcth coordinate 0 % k of 0, has variance r| oe = d' k Cov(0i)8k, and, for the fcth 
coordinate 9ik of 0\ , the mean-squared error r ke is 

E([0ik - 9 lk } 2 ) = 5 , fc F(Cov(0 i |X i )) < 5 fe . 



If the model holds, then E(0i) = 0 and Co v(0j) = D, so that r| oe = 1. 

For any nonzero fixed r-dimensional vector c, the reliability of c'0i is then 



c' Co v(0j — 0i)c 
c' Cov(0j)c 



(2) 
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The quantity c'Co >v(0j — Oi) c in (2) is both the variance of c'[0i — 9 t ), where c \9- t — 9i) 
can be considered as the error in approximation of c '9j by c 7 0j, and the mean-squared error from 
approximation of c'Oj by c '9%. Similarly, c 7 Cov(0j)c in (2) is both the variance of c'9, and the 
minimum possible mean-squared error from approximation of c’Qi by a constant. Thus, p 2 ( c) has 
the form 

. Error variance 

P C) = 1 - — T , 

Iota! variance 

which is the standard definition of reliability, and also has the form 

2 ^ Reduction in MSE from approximation of c'9i by c'9 t instead of by a constant 

MSE from approximation of c '0 L by a constant 

which is the usual form of a PRMSE (Haberman, 2008; Haberman et al., 2006). 

It follows that the PRMSE for the fcth coordinate 9i k of 9 is 

PRMSE^ = p\S k ) = 1 - r 2 ke /r 2 koe . 

In practice, 0 j must be approximated by 

9i = J w/(u>|Xj)du>, 

where 

/(w|x) =p(x|w)7r(w)/p(x), 

<? 

P(x |®) = [J P(h; ajUJ v{j) - 7,-), 

i = 1 

ft is the density of a multivariate normal random vector with mean 0 and covariance matrix D, 
and 

p(x) = J p(x|cj)7r(o;)dcj. 

For large samples, the reliability for 9[ is approximated by 

. 2( c) = : _ dc5vifi-»)C ' 

c'Cov(0)c 

where 

n 

C ot(0 -9) = n" 1 ^ Cov(0i|Xj), 

*=i 

Cov(0j|Xj) = J (cj - 0i)(o; - 0j])/(w|Xj)(iuj, 
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Cov(0) = J (uj — 0)(u: — 9)' f(uj)du), 

/ n 

ojf(u)duj = n~ l ^2 ®ii 

l— 1 

and 

n 

/M = n _1 5^/(w| x i)- 

The estimated variance f kog = d' k Cov(9)d k , and, for the kth coordinate 9i k of 9i, the estimated 
mean-squared error f kg is 8' k Cov(9 — 9)S k . The estimated PRMSE is PRMSE/^ = 1 — f kg /f kog . 

3 Applications 

We analyzed data containing examinee responses from five tests used for educational 
certification. All of these tests report subscores operationally, and our goal here was to determine 
the best possible way to report subscores for these tests. To fit the MIRT model given by (1), 
we used a FORTRAN 95 program written by the lead author; the program uses a variation 
of the stabilized Newton-Raphson algorithm (Haberman, 1988) described in Haberman et al. 
(2008). Required quadratures are performed by adaptive multivariate Gauss-Hermite integration 
(Schilling & Bock, 2005). 

3.1 Data Sets 

The tests considered here contained only multiple-choice (MC) items and represented a broad 
range of content and skill areas such as elementary education, reading, writing, mathematics, and 
foreign languages. Results from this study may provide useful information for other tests of similar 
format and content. For confidentiality reasons, hypothetical names (e.g., Test A-E) are used for 
the tests. The number of items in each subscore for the five tests are presented in Tables 1-5. A 
brief description of the tests and the operationally reported subscores for each test is presented 
below. All of these data sets were considered in Puhan, Sinharay, Haberman, and Larkin (2008). 

Test A is designed for prospective teachers of children in primary through upper elementary 
school grades. The 119 multiple-choice questions focus on four major subject areas: language 
arts/reading (30 items), mathematics (29 items), social studies (30 items), and science (30 items). 
The sample size (the number of examinees who took the form of Test A considered here) was 
31,001, and the reliability of the total test score was 0.91. 




Test B is designed for examinees who plan to teach in a special-education program at 
any grade level from preschool through grade 12. The 60 multiple-choice questions assess the 
examinee’s knowledge of three major content areas: understanding exceptionalities (13 items), 
legal and societal issues (10 items), and delivery of services to students with disabilities (33 items). 
The sample size was 7,930, and the reliability of the total test score was 0.74. 

Test C is designed to assess the knowledge and competencies necessary for a beginning or 
entry- year teacher of Spanish. This test consists of 116 MC questions organized into four broad 
categories: interpretive listening (31 items), structure of the language (35 items), interpretive 
reading (30 items), and cultural perspectives (20 items). The sample size was 2,154 and the 
reliability of the total test score was 0.94. 

Test D is designed to assess the mathematical knowledge and competencies necessary for 
a beginning teacher of secondary school mathematics. It consists of 50 MC questions arragned 
into three broad categories, namely, mathematical concepts and reasoning (17 items), ability 
to integrate knowledge of different areas of mathematics (12 items), and the ability to develop 
mathematical models of real-life situations (21 items). The sample size was 6,818, and the 
reliability of the total test score was 0.82. 

Test E is used to measure skills necessary for prospective and practicing paraprofessionals. It 
consists of 73 MC questions arranged into three broad categories: reading (25 items), mathematics 
(23 items), and writing (25 items). The sample size was 3,637, and the reliability of the total test 
score was 0.94. 

3.2 Results 

Tables 1 through 5 provide results for Tests A through E. Each of these tables shows the 
following: 

• the number of items in the subscores, 

• the estimated correlation between the raw subscores (simple and disattenuated), 

• the estimated correlation dkk' between the components 6ik and 6^ under the model, and 

• the estimates of PRMSEfc s (the subscore reliability), PRMSEfca,, PRMSEfc c , and PRMSE^. 

These tables do not provide the names of the subscores (they are given earlier in the Data Sets 
subsection) and only denotes the subscores as Subscores 1,2, .... Note that a comparison between 
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PRMSEfc c and PRMSE^ will reveal wheter the MIRT approach provides subscores that, relative 
to their variability, are more accurate than those provided by the CTT approach. 



Table 1 

Results for Test A 





1 


Subscores 
2 3 


4 


Length 


30 


30 


29 


30 


Correlation between the raw subscores 


1.00 


0.59 


0.58 


0.59 




0.78 


1.00 


0.53 


0.60 




0.80 


0.68 


1.00 


0.64 




0.84 


0.78 


0.88 


1.00 


Correlation between the components of 6j 


1.00 










0.80 


1.00 








0.84 


0.71 


1.00 






0.87 


0.80 


0.89 


1.00 


PRMSE*.,. 


0.71 


0.83 


0.73 


0.71 


PRMSE fcx 


0.77 


0.74 


0.75 


0.82 


PRMSE fcc 


0.82 


0.86 


0.82 


0.84 


PRMSE kg 


0.84 


0.87 


0.85 


0.87 



Note. In the correlation matrix between the raw subscores, the simple correlations are shown 
above the diagonal, and the disattenuated correlations are shown in bold font below the diagonal. 



The pattern of results is quite consistent. The MIRT subscores almost always yield a PRMSE 
at least as high as those provided by the augmented subscores. The differences are often quite 
small, but they are appreciable in a number of cases. 

To investigate further the relationship between the MIRT subscores and the augmented 
subscores, Figures 1 and 2 provide, for each of the 4 subscores of Test C and for each of the 3 
subscores of the Test D, (a) scatterplots of augmented subscores versus raw subscores (the panels 
in the top row), (b) the MIRT subscores versus the raw subscores (the panels in the middle row), 
and (c) the MIRT subscores versus the augmented subscores (the panels in the bottom row) for 
1,000 randomly chosen examinees. Each panel also shows the correlation between the variables 
being plotted. Results were similar for the other tests and are not shown. While the correlations 
between the raw subscores and the augmented/MIRT subscores are between 0.86 and 0.97, the 
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Table 2 

Results for Test B 



Subscores 
1 2 3 


Length 


13 


10 


33 


Correlation between the raw subscores 


1.00 


0.34 


0.51 




0.96 


1.00 


0.41 




0.95 


0.99 


1.00 


Correlation between the components of Oi 


1.00 








0.96 


1.00 






0.96 


0.94 


1.00 


PRMSE fc , 


0.46 


0.28 


0.63 


PRMSEfc* 


0.71 


0.73 


0.73 


PRMSE fcc 


0.71 


0.73 


0.73 


PRMSE** 


0.74 


0.71 


0.75 



Note. In the correlation matrix between the raw subscores, the simple correlations are shown 
above the diagonal, and the disattenuated correlations are shown in bold font below the diagonal. 



correlation between the MIRT subscores and the augmented subscores are very close to 1. In other 
words, there is a nearly perfect linear relationship between the MIRT subscores and the augmented 
subscores. Our finding of the similarity of the MIRT subscores and the augmented subscores 
supports the finding in de la Torre and Patz (2005) of the similarity of MCMC-based MIRT 
subscores and Wainer-augmented subscores. Figure 1 shows a curvilinear relationship between 
the raw subscores and the augmented/MIRT subscores. Figure 3, which shows histograms of the 
distributions of the raw subscores, augmented subscores, and MIRT subscores for Test C, shows 
substantial negative skewness in the distribution of subscores (due to several examinees obtaining 
maximum possible subscores); 1 this is the reason of the curvilinear relationship in Figure 1. 

The results indicate that Haberman augmentation and the MIRT results strongly dominate 
the results for estimates that are based only on raw subscores. The augmented subscores and the 
MIRT-based subscores improve on the raw subscores and the total score with respect to PRMSE 
for Tests A, C, and E. Interestingly, for Test D, the augmented subscores do not improve on the 
total score with respect to PRMSE, but the MIRT-based subscores do. For Test B, neither the 
augmented subscores nor the MIRT-based subscores lead to any improvement over the total score. 



11 




Table 3 

Results for Test C 





1 


Subscores 
2 3 


4 


Length 


31 


35 


30 


20 


Correlation between the raw subscores 


1.00 


0.70 


0.79 


0.53 




0.85 


1.00 


0.73 


0.55 




0.93 


0.87 


1.00 


0.58 




0.70 


0.73 


0.75 


1.00 


Correlation between the components of d, 


1.00 










0.91 


1.00 








0.95 


0.93 


1.00 






0.75 


0.77 


0.80 


1.00 


PRMSE fc <, 


0.84 


0.83 


0.86 


0.68 


PRMSE fcx 


0.85 


0.84 


0.88 


0.64 


PRMSE fcc 


0.89 


0.88 


0.91 


0.77 


PRMSE ke 


0.90 


0.90 


0.91 


0.78 



Note. In the correlation matrix between the raw subscores, the simple correlations are shown 
above the diagonal, and the disattenuated correlations are shown in bold font below the diagonal. 



The subscores are too unreliable for any diagnostic score reporting for this test. 

4 Conclusions 

The use of MIRT models to generate subscores is quite feasible, as evidenced by the examples. 
Given the similarity of results in terms of PRMSE to those from the CTT-based Haberman 
augmentation, client preferences may be a significant consideration. For clients preferring IRT 
models over CTT, this paper will provide a rational and practical approach to reporting subscores. 

Computational burden for the MIRT analysis appears acceptable — the software program 
did not take more than a couple of hours to complete the calculations for any of the data sets 
we analyzed here. Several calculation details can be modified for much larger samples. The six 
quadrature points per dimension were somewhat higher than appears needed (Haberman et al., 
2008). For example, for four dimensions, a reduction from six to three points per dimension 
reduces computational labor by a factor of about 16. In addition, it is often advisable to begin 
calculations with a few hundred or few thousand observations to establish good approximations of 
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Table 4 

Results for Test D 



Subscores 
1 2 3 


Length 


17 


12 


20 


Correlation between the raw subscores 


1.00 


0.57 


0.61 




0.95 


1.00 


0.58 




0.97 


0.94 


1.00 


Correlation between the components of Oi 


1.00 








0.92 


1.00 






0.97 


0.93 


1.00 


PRMSE fc , 


0.61 


0.59 


0.65 


PRMSE fca; 


0.81 


0.78 


0.81 


PRMSE fcc 


0.81 


0.79 


0.81 


PRMSEm 


0.83 


0.81 


0.84 



Note. In the correlation matrix between the raw subscores, the simple correlations are shown 
above the diagonal, and the disattenuated correlations are shown in bold font below the diagonal. 



maximum-likelihood estimates. The approximations would then be used to complete computations 
with the full sample. Even with improved numerical techniques, the MIRT-based approach to 
computing subscores involves a much higher computational burden than is required for the 
CTT-based approach of Haberman (2008) and Haberman et al. (2006). 

Use of the MIRT-based approach results in estimates that are more difficult to explain than 
are raw scores, although this issue can be alleviated by alternative scalings. For example, the 
conditional expectation of 8 ik given X; could be replaced by the conditional expectations of gk{9ik) 
given X,j, where, for real c u, <?*.( u>) is the test characteristic curve 

9k{w) = a i w “ Tj) 

jeJ(k) 

corresponding to 5^, so that gk{ uj) is the conditional expectation of given 9^ = cu, and 5fc(^*fc) 
is the true score corresponding to S if the model is valid. See ? (?)habsinpsych) for further 
details on this issue. 

MIRT-based estimates such as 8 ^ are not on the same scale as the raw subscores S^. This 
affects comparisons of mean-squared or root mean-square errors but does not affect comparisons of 



13 




Table 5 

Results for Test E 



Subscores 
1 2 3 


Length 


25 


23 


25 


Correlation between the raw subscores 


1.00 


0.76 


0.79 




0.90 


1.00 


0.73 




0.91 


0.86 


1.00 


Correlation between the components of Oi 


1.00 








0.92 


1.00 






0.94 


0.90 


1.00 


PRMSE fc , 


0.87 


0.84 


0.85 


PRMSEfc* 


0.90 


0.85 


0.87 


PRMSE fcc 


0.91 


0.89 


0.90 


PRMSE** 


0.91 


0.89 


0.90 



Note. In the correlation matrix between the raw subscores, the simple correlations are shown 
above the diagonal, and the disattenuated correlations are shown in bold font below the diagonal. 



PRMSE measures because any particular PRMSE is a dimensionless measure in which numerator 
and denominator are on the same scale. 

Subscores must be reported on some established scale. A temptation exists to make this 
scale comparable to the scale for the total score or to the fraction of the scale that corresponds 
to the relative importance of the subscore, but these choices are not without difficulties given 
that subscores and total scores typically differ in reliability. In addition, if the subscore is worth 
reporting at all, then the subscore presumably does not measure the same construct as the total 
score. Further, appropriate methods of equating or linking must be considered when determining 
whether and how to report subscores. In typical cases, equating is feasible for the total score but 
not for subscores. For example, if an anchor test is used to equate the total test, only a few of 
the items will correspond to a particular subscore, so anchor test equating of the subscore is not 
feasible. 
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Subscore 1 , Corr= 0.97 
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Raw subscores 
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MIRT subscores 



Subscore 2 , Corr= 0.97 
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M!RT subscores 



Subscore 3 , Corr= 0.97 
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Raw subscores 
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MIRT subscores 



Subscore 4 , Corr= 0.94 




Raw subscores 



Subscore 4 , Corr= 0.89 
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Raw subscores 



Subscore 4 , Corr= 0.97 




-2 0 12 



MIRT subscores 



Subscore 1 , Corr= 0.91 Subscore 2 , Corr= 0.94 Subscore 3 , Corr= 0.9 




Figure 1 Plots of the raw subscores, augmented subscores, and MIRT subscores versus 
each other for 1,000 examinees for Test C. The correlation coefficients between the 
variables plotted are also shown. 
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Subscore 1 , Correlation 0.87 



Subscore 2 , Correlation 0.86 



Subscore 3 , Correlation 0.9 




Raw subscores 



Subscore 1 , Correlation 0.86 




Raw subscores 



Subscore 1 , Correlation 0.98 




MIRT subscores 




Subscore 2 , Correlation 0.86 




Subscore 2 , Correlation 0.98 




MIRT subscores 




Raw subscores 



Subscore 3 , Correlation 0.88 




Raw subscores 



Subscore 3 , Correlation 0.98 




MIRT subscores 



Figure 2 Plots of the raw subscores, augmented subscores, and MIRT subscores versus 
each other for 1,000 examinees for Test D. The correlation coefficients between the 
variables plotted are also shown. 
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Figure 3 Histograms of the distributions of the raw subscores, augmented subscores, 
and MIRT subscores for Test C. The skewness of the distributions are also shown. 
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Notes 



1 The augmented subscores also have negative skewness; the MIRT subscores have slightly 
positive skewness. 
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